acoustic signal
Achieving Effective Virtual Reality Interactions via Acoustic Gesture Recognition based on Large Language Models
Zhang, Xijie, He, Fengliang, Dai, Hong-Ning
Natural and efficient interaction remains a critical challenge for virtual reality and augmented reality (VR/AR) systems. Vision-based gesture recognition suffers from high computational cost, sensitivity to lighting conditions, and privacy leakage concerns. Acoustic sensing provides an attractive alternative: by emitting inaudible high-frequency signals and capturing their reflections, channel impulse response (CIR) encodes how gestures perturb the acoustic field in a low-cost and user-transparent manner. However, existing CIR-based gesture recognition methods often rely on extensive training of models on large labeled datasets, making them unsuitable for few-shot VR scenarios. In this work, we propose the first framework that leverages large language models (LLMs) for CIR-based gesture recognition in VR/AR systems. Despite LLMs' strengths, it is non-trivial to achieve few-shot and zero-shot learning of CIR gestures due to their inconspicuous features. To tackle this challenge, we collect differential CIR rather than original CIR data. Moreover, we construct a real-world dataset collected from 10 participants performing 15 gestures across three categories (digits, letters, and shapes), with 10 repetitions each. We then conduct extensive experiments on this dataset using an LLM-adopted classifier. Results show that our LLM-based framework achieves accuracy comparable to classical machine learning baselines, while requiring no domain-specific retraining.
- Asia > China > Hong Kong (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada (0.04)
- Education (0.68)
- Health & Medicine (0.46)
WaveVerif: Acoustic Side-Channel based Verification of Robotic Workflows
Erdogan, Zeynep Yasemin, Nagaraja, Shishir, Ahmed, Chuadhry Mujeeb, Shah, Ryan
In this paper, we present a framework that uses acoustic side-channel analysis (ASCA) to monitor and verify whether a robot correctly executes its intended commands. We develop and evaluate a machine-learning-based workflow verification system that uses acoustic emissions generated by robotic movements. The system can determine whether real-time behavior is consistent with expected commands. The evaluation takes into account movement speed, direction, and microphone distance. The results show that individual robot movements can be validated with over 80% accuracy under baseline conditions using four different classifiers: Support Vector Machine (SVM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN). Additionally, workflows such as pick-and-place and packing could be identified with similarly high confidence. Our findings demonstrate that acoustic signals can support real-time, low-cost, passive verification in sensitive robotic environments without requiring hardware modifications.
- Asia > Middle East > Republic of Türkiye (0.41)
- Europe > United Kingdom (0.40)
- North America > United States > District of Columbia > Washington (0.05)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Energy (1.00)
- Government > Regional Government > Asia Government > Middle East Government > Republic of Türkiye Government (0.41)
Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment
Hoang, Long-Vu, Nguyen, Tuan, Dat, Tran Huy
This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.
AquaSignal: An Integrated Framework for Robust Underwater Acoustic Analysis
Panteli, Eirini, Santos, Paulo E., Humphrey, Nabil
This paper presents AquaSignal, a modular and scalable pipeline for preprocessing, denoising, classification, and novelty detection of underwater acoustic signals. Designed to operate effectively in noisy and dynamic marine environments, AquaSignal integrates state-of-the-art deep learning architectures to enhance the reliability and accuracy of acoustic signal analysis. The system is evaluated on a combined dataset from the Deepship and Ocean Networks Canada (ONC) benchmarks, providing a diverse set of real-world underwater scenarios. AquaSignal employs a U-Net architecture for denoising, a ResNet18 convolutional neural network for classifying known acoustic events, and an AutoEncoder-based model for unsupervised detection of novel or anomalous signals. To our knowledge, this is the first comprehensive study to apply and evaluate this combination of techniques on maritime vessel acoustic data. Experimental results show that AquaSignal improves signal clarity and task performance, achieving 71% classification accuracy and 91% accuracy in novelty detection. Despite slightly lower classification performance compared to some state-of-the-art models, differences in data partitioning strategies limit direct comparisons. Overall, AquaSignal demonstrates strong potential for real-time underwater acoustic monitoring in scientific, environmental, and maritime domains.
- North America > Canada (0.25)
- North America > United States (0.14)
- Europe > Netherlands > South Holland > Rotterdam (0.14)
- (6 more...)
- Transportation > Marine (1.00)
- Energy (1.00)
- Government > Military (0.68)
A multi-head deep fusion model for recognition of cattle foraging events using sound and movement signals
Ferrero, Mariano, Chelotti, José Omar, Martinez-Rau, Luciano Sebastián, Vignolo, Leandro, Pires, Martín, Galli, Julio Ricardo, Giovanini, Leonardo Luis, Rufiner, Hugo Leonardo
Monitoring feeding behaviour is a relevant task for efficient herd management and the effective use of available resources in grazing cattle. The ability to automatically recognise animals' feeding activities through the identification of specific jaw movements allows for the improvement of diet formulation, as well as early detection of metabolic problems and symptoms of animal discomfort, among other benefits. The use of sensors to obtain signals for such monitoring has become popular in the last two decades. The most frequently employed sensors include accelerometers, microphones, and cameras, each with its own set of advantages and drawbacks. An unexplored aspect is the simultaneous use of multiple sensors with the aim of combining signals in order to enhance the precision of the estimations. In this direction, this work introduces a deep neural network based on the fusion of acoustic and inertial signals, composed of convolutional, recurrent, and dense layers. The main advantage of this model is the combination of signals through the automatic extraction of features independently from each of them. The model has emerged from an exploration and comparison of different neural network architectures proposed in this work, which carry out information fusion at different levels. Feature-level fusion has outperformed data and decision-level fusion by at least a 0.14 based on the F1-score metric. Moreover, a comparison with state-of-the-art machine learning methods is presented, including traditional and deep learning approaches. The proposed model yielded an F1-score value of 0.802, representing a 14% increase compared to previous methods. Finally, results from an ablation study and post-training quantization evaluation are also reported.
- South America > Argentina (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Sweden (0.04)
- (5 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.46)
- Health & Medicine (1.00)
- Food & Agriculture > Agriculture (1.00)
Indoor Drone Localization and Tracking Based on Acoustic Inertial Measurement
Sun, Yimiao, Wang, Weiguo, Mottola, Luca, Jia, Zhang, Wang, Ruijin, He, Yuan
We present Acoustic Inertial Measurement (AIM), a one-of-a-kind technique for indoor drone localization and tracking. Indoor drone localization and tracking are arguably a crucial, yet unsolved challenge: in GPS-denied environments, existing approaches enjoy limited applicability, especially in Non-Line of Sight (NLoS), require extensive environment instrumentation, or demand considerable hardware/software changes on drones. In contrast, AIM exploits the acoustic characteristics of the drones to estimate their location and derive their motion, even in NLoS settings. We tame location estimation errors using a dedicated Kalman filter and the Interquartile Range rule (IQR) and demonstrate that AIM can support indoor spaces with arbitrary ranges and layouts. We implement AIM using an off-the-shelf microphone array and evaluate its performance with a commercial drone under varied settings. Results indicate that the mean localization error of AIM is 46% lower than that of commercial UWB-based systems in a complex 10m\times10m indoor scenario, where state-of-the-art infrared systems would not even work because of NLoS situations. When distributed microphone arrays are deployed, the mean error can be reduced to less than 0.5m in a 20m range, and even support spaces with arbitrary ranges and layouts.
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Asia > China > Sichuan Province > Chengdu (0.04)
- (4 more...)
BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds
Shibata, Yuto, Oumi, Yusuke, Irie, Go, Kimura, Akisato, Aoki, Yoshimitsu, Isogawa, Mariko
We propose BGM2Pose, a non-invasive 3D human pose estimation method using arbitrary music (e.g., background music) as active sensing signals. Unlike existing approaches that significantly limit practicality by employing intrusive chirp signals within the audible range, our method utilizes natural music that causes minimal discomfort to humans. Estimating human poses from standard music presents significant challenges. In contrast to sound sources specifically designed for measurement, regular music varies in both volume and pitch. These dynamic changes in signals caused by music are inevitably mixed with alterations in the sound field resulting from human motion, making it hard to extract reliable cues for pose estimation. To address these challenges, BGM2Pose introduces a Contrastive Pose Extraction Module that employs contrastive learning and hard negative sampling to eliminate musical components from the recorded data, isolating the pose information. Additionally, we propose a Frequency-wise Attention Module that enables the model to focus on subtle acoustic variations attributable to human movement by dynamically computing attention across frequency bands. Experiments suggest that our method outperforms the existing methods, demonstrating substantial potential for real-world applications. Our datasets and code will be made publicly available.
- Information Technology (0.68)
- Leisure & Entertainment (0.66)
- Media > Music (0.48)
Software implemented fault diagnosis of natural gas pumping unit based on feedforward neural network
Kozlenko, Mykola, Zamikhovska, Olena, Zamikhovskyi, Leonid
In recent years, more and more attention has been paid to the use of artificial neural networks (ANN) for diagnostics of gas pumping units (GPU). Usually, ANN training is carried out on models of GPU workflows, and generated sets of diagnostic data are used to simulate defect conditions. At the same time, the results obtained do not allow assessing the real state of the GPU. It is proposed to use the values of the characteristics of the acoustic and vibration processes of the GPU as the input data of the ANN. A descriptive statistical analysis of real vibration and acoustic processes generated by the operation of the GPU type GTK-25-i (Nuovo Pignone, Italy) has been carried out. The formation of packets of diagnostic signs arriving at the input of the ANN has been carried out. The diagnostic features are the five maximum amplitude components of the acoustic and vibration signals, as well as the value of the standard deviation for each sample. Diagnostic signs are calculated directly in the input pipeline of ANN data in real time for three technical states of the GPU. Using the frameworks TensorFlow, Keras, NumPy, pandas, in the Python 3 programming language, an architecture was developed for a deep fully connected feedforward ANN, training on the error backpropagation algorithm. The results of training and testing of the developed ANN are presented. During testing, it was found that the signal classification precision for the "nominal" state of all 1475 signal samples is 1.0000, for the "current" state, precision equils 0.9853, and for the "defective" state, precision is 0.9091. The use of the developed ANN makes it possible to classify the technical states of the GPU with an accuracy sufficient for practical use, which will prevent the occurrence of GPU failures. ANN can be used to diagnose GPU of any type and power.
- Europe > Italy (0.24)
- Europe > Ukraine (0.14)
- North America > United States > California (0.14)
SonicBoom: Contact Localization Using Array of Microphones
Lee, Moonyoung, Yoo, Uksang, Oh, Jean, Ichnowski, Jeffrey, Kantor, George, Kroemer, Oliver
In cluttered environments where visual sensors encounter heavy occlusion, such as in agricultural settings, tactile signals can provide crucial spatial information for the robot to locate rigid objects and maneuver around them. We introduce SonicBoom, a holistic hardware and learning pipeline that enables contact localization through an array of contact microphones. While conventional sound source localization methods effectively triangulate sources in air, localization through solid media with irregular geometry and structure presents challenges that are difficult to model analytically. We address this challenge through a feature engineering and learning based approach, autonomously collecting 18,000 robot interaction sound pairs to learn a mapping between acoustic signals and collision locations on the robot end effector link. By leveraging relative features between microphones, SonicBoom achieves localization errors of 0.42cm for in distribution interactions and maintains robust performance of 2.22cm error even with novel objects and contact conditions. We demonstrate the system's practical utility through haptic mapping of occluded branches in mock canopy settings, showing that acoustic based sensing can enable reliable robot navigation in visually challenging environments.
- Europe > Spain > Galicia > Madrid (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Acoustic-based 3D Human Pose Estimation Robust to Human Position
Oumi, Yusuke, Shibata, Yuto, Irie, Go, Kimura, Akisato, Aoki, Yoshimitsu, Isogawa, Mariko
This paper explores the problem of 3D human pose estimation from only low-level acoustic signals. The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along a line between loudspeakers and a microphone. Because reflection and diffraction of sound by the human body cause subtle acoustic signal changes compared to sound obstruction, the existing model degrades its accuracy significantly when subjects deviate from this line, limiting its practicality in real-world scenarios. To overcome this limitation, we propose a novel method composed of a position discriminator and reverberation-resistant model. The former predicts the standing positions of subjects and applies adversarial learning to extract subject position-invariant features. The latter utilizes acoustic signals before the estimation target time as references to enhance robustness against the variations in sound arrival times due to diffraction and reflection. We construct an acoustic pose estimation dataset that covers diverse human locations and demonstrate through experiments that our proposed method outperforms existing approaches.